A short description of the post.
In the country island of Kronos, the increasing noxious effects on health and farming have been related to the uncontrolled activity of GAStech, a natural gas operator, supported by corrupt government officials. On January 20th, 2014, a corporate meeting is held to celebrate the new-found fortune because of the initial public offering of the company. However, a series of rare events occur that lead to the disappearance of several employees. The Protectors of Kronos (POK), a social movement organization that has been fighting against water contamination and government corruption, is suspected in the disappearance.
As analysts, we were assigned with several tasks in order to identify risks and how they could have been mitigated more effectively.
Using data and visual analytics to evaluate the changing levels of risk to the public and recommend actions for first responder:
Distinguish meaningful event reports from typical chatter from junk or spam.
Evaluate the level of the risk to the public evolves over the course of the evening. Consider the potential consequences of the situation and the number of people who could be affected. Determine the appropriate location for first responders.
The differences between dealing with this challenge in 2014 and dealing with it now.
First, we run this fist line of code to clear the environment and remove existing R objects(if any).
This code chunk checks if required packages are installed. If they are not installed, the next line of code will install them. The following line is then use to import the library into the current working environment.
packages = c('readr','tidytext','data.table','lubridate','ggplot2',
'caret','dplyr','tidyr','scales','quanteda','textdata',
'stringr','stringi','reshape2','RColorBrewer','wordcloud',
'forcats','igraph','ggraph','widyr','clock','knitr','tidyverse',
'DT','hms','ggiraph','topicmodels','raster','sf','maptools',
'rgdal','ggmap','sp','mapview','tmap','gifski')
for(p in packages){
if(!require(p,character.only = TRUE)){
install.packages(p)
}
library(p,character.only = TRUE)
}
First, use read_csv() to import the csv file.
read1 <- read_csv("F:/visual/assignment and project/MC3/MC3/csv-1700-1830.csv",
col_types = list(col_character(),col_character(),col_character(),
col_character(),col_double(),col_double(),
col_character()))
read2 <- read_csv("F:/visual/assignment and project/MC3/MC3/csv-1831-2000.csv",
col_types = list(col_character(),col_character(),col_character(),
col_character(),col_double(),col_double(),
col_character()))
read3 <- read_csv("F:/visual/assignment and project/MC3/MC3/csv-2001-2131.csv",
col_types = list(col_character(),col_character(),col_character(),
col_character(),col_double(),col_double(),
col_character()))
Using function rbind() combine these three csv files with the same format.
df <- rbind.data.frame(read1,read2,read3)
glimpse(df)
Rows: 4,063
Columns: 7
$ type <chr> "mbdata", "mbdata", "mbdata", "mbdata~
$ `date(yyyyMMddHHmmss)` <chr> "20140123170000", "20140123170000", "~
$ author <chr> "POK", "maha_Homeland", "Viktor-E", "~
$ message <chr> "Follow us @POK-Kronos", "Don't miss ~
$ latitude <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ longitude <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ location <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "~
From the above table the date(yyyyMMddHHmmss) is not in time format, so converting to date-time field. Because in the mini-challenge 3, all activities occur on the same day.Extract time(hms) data without date and transform.
df$`date(yyyyMMddHHmmss)` <- date_time_parse(df$`date(yyyyMMddHHmmss)`,
zone = "",
format = "%Y%m%d %H%M%S")
df$time <- as_hms(ymd_hms((df$`date(yyyyMMddHHmmss)`)))
glimpse(df)
Rows: 4,063
Columns: 8
$ type <chr> "mbdata", "mbdata", "mbdata", "mbdata~
$ `date(yyyyMMddHHmmss)` <dttm> 2014-01-23 17:00:00, 2014-01-23 17:0~
$ author <chr> "POK", "maha_Homeland", "Viktor-E", "~
$ message <chr> "Follow us @POK-Kronos", "Don't miss ~
$ latitude <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ longitude <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ location <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "~
$ time <time> 17:00:00, 17:00:00, 17:00:00, 17:00:~
In question 1, extract the required data from the df to make a new data frame. After reading all the data carefully, the terminology of mbdata and ccdata is very different, so separate the two files. Since ccdata is a police or fire department record, this dataset is labeled as a meaningful dataset.
df1 <- subset(df, select = c("type","author","message"))
df_m <- subset(df1, type == "mbdata")
df_cc <- subset(df1,type == "ccdata")
df_cc$condition <- "meaningful"
DT::datatable(df_cc,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = TRUE, columnDefs = list(list(width = '400px', targets = c(3))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
JUNk definition: After reading all the data carefully, I have selected the following types of data.
Authors like “KronosQuoth” will post content related to the three types of activities(rally, fire to collapsed, accident to gunshot) in this challenge, but cannot or rarely provide important information such as location. The content is mostly emotional and cathartic.
Content that includes the term “#Grammar” is mostly unrelated to important events in mini-challenge 3.
Content containing “RT” is a forwarding of other people’s messages, providing the same information repeatedly. And, the number of “RT” message is very large.
junk <- df_m %>%
filter(str_detect(author,"KronosQuoth|Clevvah4Eva|choconibbs|trollingsnark|
blueSunshine|whiteprotein|FriendsOfKronos|junkman377|
junkman995|redisrad|roger_roger|cheapgoods998|rockinHW|
panopticon|dels4realz|eazymoney|cleaningFish")|
str_detect(message,"#Grammar|RT"))
Meaningful definition: After reading all the data carefully, I have selected the following types of data.
Information released by authors who are more official or authoritative. Author like “POK”,“AbilaPoliceDepartment”.
Information released by authors are witnesses to events and publishes information that provides important information. Authors like “magaMan”,“Sara_Nespola”.
Content that includes terms like “fire”,“rally” is mostly related to the important events in mini-challenge 3.
meaningful <- df_m %>%
filter(str_detect(author,"POK|AbilaPost|CentralBulletin|ourcountryyourrights|
MindOfKronos|Viktor-E|maha_Homeland,anaregents|wordWatcher|InternationalNews|
HomelandIlluminations|NewsOnlineToday|AbilaPoliceDepartment|KronosStar|magaMan|
Sara_Nespola|protoGuy|SiaradSea|AbilaFire|footfingers|truthforcadau|truccotrucco|
dangermice|trapanitweets|sofitees|brewvebeenserved|hennyhenhendrix")|
str_detect(message,[2191 chars quoted with '"']))
Meaningless definition: After reading all the data carefully, I have selected the following types of data.
Content that includes terms like “fire”,“rally” which is related to the events but is used to express emotions and can not provide important information.
Content that is unrelated to the events, but has their own objectives like sharing the information of shops.
This group is obtained by subtracting other groups from the df_m through anti_join() function.
meaningful <- dplyr::anti_join(meaningful,junk,by = c("type", "author", "message"))
combinedata <- rbind.data.frame(meaningful, junk)
meaningless <- dplyr::anti_join(df_m,combinedata,by = c("type", "author", "message"))
Combine meaningful,meaningless and junk data and add a new label column.
junk$condition <- "junk"
meaningful$condition <- "meaningful"
meaningless$condition <- "meaningless"
finalq1 <- rbind.data.frame(meaningful,junk, meaningless)
DT::datatable(finalq1,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = TRUE, columnDefs = list(list(width = '400px', targets = c(3))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
Use stringr package to remove punctuation, @, #, < and Chinese characters from messages. The messages in ccdata are very clean and do not require special handling.
finalq1$message <- str_replace_all(finalq1$message,'[[:punct:]]+', "")
finalq1$message <- str_replace_all(finalq1$message,fixed("@"),"")
finalq1$message <- str_replace_all(finalq1$message,fixed("#"),"")
finalq1$message <- str_replace_all(finalq1$message,fixed("<"),"")
finalq1$message <- str_replace_all(finalq1$message,"[\u4e00-\u9fa5]+", "")
tidy_m <- finalq1 %>%
unnest_tokens(word, message)%>%
count(condition,word,sort = TRUE)
data(stop_words)
tidy_m <- tidy_m %>%
anti_join(stop_words)
my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
"yay","ya","xx3942","wuz","wow",
"dr"))
tidy_m <- tidy_m %>%
anti_join(my_stopwords)
tidy_cc <- df_cc %>%
unnest_tokens(word,message) %>%
count(word, sort = TRUE)
DT::datatable(tidy_m,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:3))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
The graph below is the word number distribution of junk,meaningful and meaningless group. The junk message has high repetition rate of words.
ggplot(tidy_m,aes(n,fill = condition))+
geom_histogram(show.legend = FALSE)+
xlim(0,100)+
facet_wrap(~condition, ncol = 2,scales = "free_y")

The graph below is the top 15 word(n) of junk,meaningful and meaningless group.
Junk group
Most of them are words related to POK Rally, used to vent feelings or words related to success. There are also statements against the government and police. Words like “success”,“life”,“police”.
There are also obvious spam features such as rt and grammar. These message number is very large and do not provide useful information.
Meaningful group
Words can provide information about the important people of the events. Words like “viktore”.
Words can provide information on the important processes of the events. Words like “standoff”.
Words can provide information on the important locations of the events. words like “dolphin”,“dancing”.
Meaningless group
tidy_m %>%
group_by(condition) %>%
slice_max(n, n= 15) %>%
ungroup() %>%
mutate(word = reorder(word,n)) %>%
ggplot(aes(x = n,
y= word,
fill = condition))+
geom_col(show.legend = FALSE)+
facet_wrap(~ condition, ncol = 2,scales = "free")

The graph of ccdata shows that the events like “fire”,“traffic”.
tidy_cc %>%
slice_max(n,n = 15) %>%
ggplot(aes(x = n,
y= reorder(word,n)))+
geom_col(show.legend = FALSE)+
labs(y = NULL)

The conclusion is similar to Simple EDA.
wordcloud_m <- tidy_m
wordcloud_m <- finalq1 %>%
filter(condition == "meaningful") %>%
unnest_tokens(word, message)%>%
anti_join(stop_words) %>%
anti_join(my_stopwords) %>%
count(word,sort = TRUE)%>%
with(wordcloud(word,n,max.words = 100))
wordcloud_m <- finalq1 %>%
filter(condition == "meaningless") %>%
unnest_tokens(word, message)%>%
anti_join(stop_words) %>%
anti_join(my_stopwords) %>%
count(word,sort = TRUE)%>%
with(wordcloud(word,n,max.words = 100))
wordcloud_m <- finalq1 %>%
filter(condition == "junk") %>%
unnest_tokens(word, message)%>%
count(word,sort = TRUE)%>%
anti_join(stop_words) %>%
anti_join(my_stopwords) %>%
with(wordcloud(word,n,max.words = 100))
tidy_cc %>%
with(wordcloud(word,n,max.words = 100))

m_tf_idf <- tidy_m %>%
bind_tf_idf(word,condition,n)
m_tf_idf %>%
arrange(desc(tf_idf))
# A tibble: 4,111 x 6
condition word n tf idf tf_idf
<chr> <chr> <int> <dbl> <dbl> <dbl>
1 junk rt 1000 0.0558 1.10 0.0613
2 junk kronosstar 884 0.0493 0.405 0.0200
3 junk homelandilluminations 183 0.0102 1.10 0.0112
4 junk grammar 157 0.00876 1.10 0.00962
5 junk abilapost 330 0.0184 0.405 0.00746
6 junk rally 260 0.0145 0.405 0.00588
7 meaningless cards 9 0.00521 1.10 0.00572
8 meaningless easycreditkronosmorecredit 9 0.00521 1.10 0.00572
9 meaningless nobanks 9 0.00521 1.10 0.00572
10 meaningful abilapost 70 0.0137 0.405 0.00555
# ... with 4,101 more rows
m_tf_idf %>%
group_by(condition) %>%
slice_max(tf_idf, n = 15) %>%
ungroup() %>%
ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = condition))+
geom_col(show.legend = FALSE)+
facet_wrap(~condition,ncol = 2,scales = "free")+
labs(x = "word",y = NULL)

meaningful_bigrams <- meaningful %>%
unnest_tokens(bigram,message,token = "ngrams", n = 2)
meaningful_bigrams %>%
count(bigram, sort = TRUE)
# A tibble: 5,775 x 2
bigram n
<chr> <int>
1 viktor e 48
2 of the 42
3 dancing dolphin 41
4 in the 40
5 at the 38
6 abila centralbulletin 30
7 pok rally 28
8 to the 24
9 dr newman 23
10 dolphin fire 20
# ... with 5,765 more rows
meaningful_separated <- meaningful_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
meaningful_filtered <- meaningful_separated %>%
filter(!word1 %in% my_stopwords) %>%
filter(!word2 %in% my_stopwords)
meaningful_filtered <- meaningful_filtered %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
meaningful_counts <- meaningful_filtered %>%
count(word1, word2, sort = TRUE)
meaningful_graph <- meaningful_counts %>%
filter(n > 4) %>%
graph_from_data_frame()
set.seed(2020)
a <- grid::arrow(type = "closed",length = unit(.15,"inches"))
ggraph(meaningful_graph,layout = "fr")+
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
junk_bigrams <- junk %>%
unnest_tokens(bigram,message,token = "ngrams", n = 2)
junk_bigrams %>%
count(bigram, sort = TRUE)
# A tibble: 7,338 x 2
bigram n
<chr> <int>
1 pokrally hi 670
2 kronosstar pokrally 598
3 pok rally 234
4 rt homelandilluminations 183
5 rt abilapost 169
6 rally grammar 157
7 rt kronosstar 143
8 if you 126
9 of the 115
10 you can 102
# ... with 7,328 more rows
junk_separated <- junk_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
junk_filtered <- junk_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
junk_counts <- junk_filtered %>%
count(word1, word2, sort = TRUE)
junk_graph <- junk_counts %>%
filter(n > 50) %>%
graph_from_data_frame()
set.seed(2020)
a <- grid::arrow(type = "closed",length = unit(.15,"inches"))
ggraph(meaningful_graph,layout = "fr")+
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
meaningless_bigrams <- meaningless %>%
unnest_tokens(bigram,message,token = "ngrams", n = 2)
meaningless_bigrams %>%
count(bigram, sort = TRUE)
# A tibble: 2,051 x 2
bigram n
<chr> <int>
1 badprofiles.kronos tacky 12
2 of the 12
3 viktor e 10
4 abila nobanks 9
5 abila pictures 9
6 cards get 9
7 credit cards 9
8 easy credit 9
9 easycredit.kronos morecredit 9
10 get what 9
# ... with 2,041 more rows
meaningless_separated <- meaningless_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
meaningless_filtered <- meaningless_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
meaningless_counts <- meaningless_filtered %>%
count(word1, word2, sort = TRUE)
meaningless_graph <- meaningless_counts %>%
filter(n > 4) %>%
graph_from_data_frame()
set.seed(2020)
a <- grid::arrow(type = "closed",length = unit(.15,"inches"))
ggraph(meaningful_graph,layout = "fr")+
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()

m_words_cor <- tidy_m %>%
group_by(condition) %>%
filter(n()>=20) %>%
pairwise_cor(word,condition,sort = TRUE)
Repeating the process above and get a whole meaningful dataset.
q2_m <- subset(df, type == "mbdata")
q2_cc <- subset(df,type == "ccdata")
q2_junk <- q2_m %>%
filter(str_detect(author,"KronosQuoth|Clevvah4Eva|choconibbs|trollingsnark|
blueSunshine|whiteprotein|FriendsOfKronos|junkman377|
junkman995|redisrad|roger_roger|cheapgoods998|rockinHW|
panopticon|dels4realz|eazymoney|cleaningFish")|
str_detect(message,"#Grammar|RT"))
q2_meaningful <- q2_m %>%
filter(str_detect(author,"POK|AbilaPost|CentralBulletin|ourcountryyourrights|
MindOfKronos|Viktor-E|maha_Homeland,anaregents|wordWatcher|InternationalNews|
HomelandIlluminations|NewsOnlineToday|AbilaPoliceDepartment|KronosStar|magaMan|
Sara_Nespola|protoGuy|SiaradSea|AbilaFire|footfingers|truthforcadau|truccotrucco|
dangermice|trapanitweets|sofitees|brewvebeenserved|hennyhenhendrix")|
str_detect(message,[2191 chars quoted with '"']))
q2_meaningful <- dplyr::anti_join(q2_meaningful,q2_junk)
pokrally,Abila City Park,Stand Up Speak Up Sylvia Marek, Dr. Audrey McConnell Newman, Professor Lorenzo Di Stefano, Lucio Jakab and Viktor-E
q2_rally_m <- q2_meaningful %>%
filter(str_detect(message,"pokrally|Abila City Park|Stand Up Speak Up|Sylvia Marek|Audrey McConnell Newman, Professor Lorenzo Di Stefano|Lucio Jakab|Viktor-E|Sylvia|Marek|Newman|Stefano|Di Stefano|Lucio|Jakab"))
q2_rally_cc <- q2_cc %>%
filter(str_detect(message,"ABILA CITY PARK|CROWD"))
q2_rally <- rbind(q2_rally_m,q2_rally_cc)
DT::datatable(q2_rally,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = TRUE, columnDefs = list(list(width = '400px', targets = c(4))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
q2_rally$message <- str_replace_all(q2_rally$message,'[[:punct:]]+', "")
q2_rally$message <- str_replace_all(q2_rally$message,fixed("@"),"")
q2_rally$message <- str_replace_all(q2_rally$message,fixed("#"),"")
q2_rally$message <- str_replace_all(q2_rally$message,fixed("<"),"")
q2_rally$message <- str_replace_all(q2_rally$message,"[\u4e00-\u9fa5]+", "")
q2_rally_tidy <- q2_rally %>%
unnest_tokens(word, message)
data(stop_words)
q2_rally_tidy <- q2_rally_tidy %>%
anti_join(stop_words)
my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
"yay","ya","xx3942","wuz","wow",
"dr"))
q2_rally_tidy <- q2_rally_tidy %>%
anti_join(my_stopwords)
DT::datatable(q2_rally_tidy,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:3))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
q2_fire_m <- q2_meaningful %>%
filter(str_detect(message,"fire|dolphin|dancing|building|apartment|Madeg|dispatch|afd|floor|floors|fireman|firefighters|firefighter|evacuate|evacuated|evacuating|evacuation|trapped|injuries|scene|trapped|collapsed|blaze|escalated"))
q2_fire_cc <- q2_cc %>%
filter(str_detect(message,"Fire|Crime|scene"))
q2_fire <- rbind(q2_fire_m,q2_fire_cc)
DT::datatable(q2_fire,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = TRUE, columnDefs = list(list(width = '400px', targets = c(4))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
q2_fire$message <- str_replace_all(q2_fire$message,'[[:punct:]]+', "")
q2_fire$message <- str_replace_all(q2_fire$message,fixed("@"),"")
q2_fire$message <- str_replace_all(q2_fire$message,fixed("#"),"")
q2_fire$message <- str_replace_all(q2_fire$message,fixed("<"),"")
q2_fire$message <- str_replace_all(q2_fire$message,"[\u4e00-\u9fa5]+", "")
q2_fire_tidy <- q2_fire %>%
unnest_tokens(word, message)
data(stop_words)
q2_fire_tidy <- q2_fire_tidy %>%
anti_join(stop_words)
my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
"yay","ya","xx3942","wuz","wow",
"dr"))
q2_fire_tidy <- q2_fire_tidy %>%
anti_join(my_stopwords)
DT::datatable(q2_fire_tidy,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:3))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
q2_accident_m <- q2_meaningful %>%
filter(str_detect(message,"shooting|stanoff|hostage|swat|negotiation|fight|arrest|hit|van| driver|bicyclist|accident|incident|bike|L829|pursuit|gun|shot|kill|dead|yelling|screaming|negotiatingnegotiator|caught|over|end|shoot|shot|chasing"))
q2_accident_cc <- q2_cc %>%
filter(str_detect(message,"van|pursuit|accident|vandalism|swat"))
q2_accident <- rbind(q2_accident_m,q2_accident_cc)
DT::datatable(q2_accident,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = TRUE, columnDefs = list(list(width = '400px', targets = c(4))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
q2_accident$message <- str_replace_all(q2_accident$message,'[[:punct:]]+', "")
q2_accident$message <- str_replace_all(q2_accident$message,fixed("@"),"")
q2_accident$message <- str_replace_all(q2_accident$message,fixed("#"),"")
q2_accident$message <- str_replace_all(q2_accident$message,fixed("<"),"")
q2_accident$message <- str_replace_all(q2_accident$message,"[\u4e00-\u9fa5]+", "")
q2_accident_tidy <- q2_accident %>%
unnest_tokens(word, message)
data(stop_words)
q2_accident_tidy <- q2_accident_tidy %>%
anti_join(stop_words)
my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
"yay","ya","xx3942","wuz","wow",
"dr"))
q2_accident_tidy <- q2_accident_tidy %>%
anti_join(my_stopwords)
DT::datatable(q2_accident_tidy,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:3))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
q2_rally_tidy%>%
filter(str_detect(word,"rally")) %>%
ggplot(aes(x = time)) +
geom_histogram()+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y ="rally")
q2_rally_tidy%>%
filter(str_detect(word,"pok")) %>%
ggplot(aes(x = time)) +
geom_histogram()+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y ="pok")
q2_rally_tidy%>%
filter(str_detect(word,"sylvia|marek")) %>%
ggplot(aes(x = time)) +
geom_histogram()+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y ="sylvia|marek")
q2_rally_tidy%>%
filter(str_detect(word,"audrey|mcConnell|newman")) %>%
ggplot(aes(x = time)) +
geom_histogram()+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y ="audrey|mcConnell|newman")
q2_rally_tidy%>%
filter(str_detect(word,"lucio|jakab")) %>%
ggplot(aes(x = time)) +
geom_histogram()+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y ="audrey|mcConnell|newman")
q2_rally_tidy%>%
filter(str_detect(word,"lorenzo|di|stefano")) %>%
ggplot(aes(x = time)) +
geom_histogram()+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y ="lorenzo|di|stefano")
q2_rally_tidy%>%
filter(str_detect(word,"viktor")) %>%
ggplot(aes(x = time)) +
geom_histogram()+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y ="viktor")

q2_fire_tidy%>%
filter(str_detect(word,"fire|dolphin|dancing|building|apartment")) %>%
ggplot(aes(x = time)) +
geom_histogram()+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y = NULL)
q2_fire_tidy%>%
filter(str_detect(word,"floor|floors|upper||resident|trapped")) %>%
ggplot(aes(x = time)) +
geom_histogram()+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y =NULL)
q2_fire_tidy%>%
filter(str_detect(word,"afd|police|cop|cops|fireman|firefighter")) %>%
ggplot(aes(x = time)) +
geom_histogram()+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y =NULL)
q2_fire_tidy%>%
filter(str_detect(word,"ambulance|injury|injuries|evacuated|evacuating|evacuation|evacuate")) %>%
ggplot(aes(x = time)) +
geom_histogram()+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y =NULL)
q2_fire_tidy%>%
filter(str_detect(word,"control")) %>%
ggplot(aes(x = time)) +
geom_histogram()+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y = "control")
q2_fire_tidy%>%
filter(str_detect(word,"collapsed|blaze|escalated|explosion")) %>%
ggplot(aes(x = time)) +
geom_histogram()+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y =NULL)

q2_accident_tidy%>%
filter(str_detect(word,"hit|run|van|bicyclist|driver|incident|accident|bike|pursuit")) %>%
ggplot(aes(x = time)) +
geom_histogram()+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y = NULL)
q2_accident_tidy%>%
filter(str_detect(word,"gun|shoot|shot|histage")) %>%
ggplot(aes(x = time)) +
geom_histogram()+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y =NULL)
q2_accident_tidy%>%
filter(str_detect(word,"killed||dead")) %>%
ggplot(aes(x = time)) +
geom_histogram()+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y =NULL)
q2_accident_tidy%>%
filter(str_detect(word,"standoff|negotiating|negotiate|negotiator|negotiation|")) %>%
ggplot(aes(x = time)) +
geom_histogram()+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y =NULL)
q2_accident_tidy%>%
filter(str_detect(word,"yelling|screaming|chasing|standoff")) %>%
ggplot(aes(x = time)) +
geom_histogram()+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y = "control")
q2_accident_tidy%>%
filter(str_detect(word,"end|over|caught|rescued")) %>%
ggplot(aes(x = time)) +
geom_histogram()+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y =NULL)

q3_lda <- subset(q2_meaningful,select = c("type","date(yyyyMMddHHmmss)","author",
"message","latitude","longitude","time"))
q3_lda <- na.omit(q3_lda)
tidy_q3 <- q3_lda %>%
unnest_tokens(word,message)
q3_wordcount <- tidy_q3 %>%
anti_join(stop_words)
my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
"yay","ya","xx3942","wuz","wow",
"dr"))
q3_wordcount <- q3_wordcount %>%
anti_join(my_stopwords) %>%
count(author,word,sort = TRUE)
q3_wordcount
# A tibble: 453 x 3
author word n
<chr> <chr> <int>
1 footfingers pok 21
2 truccotrucco standoff 21
3 truccotrucco im 12
4 truccotrucco shooting 12
5 truccotrucco gelatogalore 9
6 truccotrucco van 9
7 truthforcadau viktor 8
8 footfingers kronos 7
9 dangermice abilafire 6
10 footfingers people 6
# ... with 443 more rows
q3_dtm <- q3_wordcount %>%
cast_dfm(author,word,n)
bgmap <- raster("F:/visual/assignment and project/MC3/MC3/Geospatial/MC2-tourist.tif")
abala_st <- st_read(dsn = "F:/visual/assignment and project/MC3/MC3/Geospatial",
layer = "Abila")
Reading layer `Abila' from data source
`F:\visual\assignment and project\MC3\MC3\Geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS: WGS 84
q3_gps <- subset(q2_meaningful,select = c("type","date(yyyyMMddHHmmss)","author",
"message","latitude","longitude","time"))
gps_m <- na.omit(q3_gps)
gps_sf <- st_as_sf(gps_m,
coords = c("longitude","latitude"),
crs = 4326)
gps_point <- gps_sf %>%
group_by(message) %>%
summarize(m = time,
do_union = FALSE) %>%
st_cast("MULTIPOINT")
q3_fire_gps <- subset(q2_fire_m,select = c("type","date(yyyyMMddHHmmss)","author",
"message","latitude","longitude","time"))
gps_fire_m <- na.omit(q3_fire_gps)
gps_fire_sf <- st_as_sf(gps_fire_m,
coords = c("longitude","latitude"),
crs = 4326)
gps_fire_sf
Simple feature collection with 15 features and 5 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 24.856 ymin: 36.059 xmax: 24.894 ymax: 36.059
Geodetic CRS: WGS 84
# A tibble: 15 x 6
type `date(yyyyMMddHHmms~ author message time
* <chr> <dttm> <chr> <chr> <time>
1 mbdata 2014-01-23 18:42:10 danger~ "i think the dancin d~ 18:42:10
2 mbdata 2014-01-23 18:55:01 danger~ "Police are flooding ~ 18:55:01
3 mbdata 2014-01-23 18:55:05 danger~ "Oh man they are evac~ 18:55:05
4 mbdata 2014-01-23 19:00:00 danger~ "Someone just got res~ 19:00:00
5 mbdata 2014-01-23 19:16:07 danger~ "Wow - more police #a~ 19:16:07
6 mbdata 2014-01-23 19:20:00 danger~ "More fire trucks are~ 19:20:00
7 mbdata 2014-01-23 19:22:00 danger~ "They are announcing ~ 19:22:00
8 mbdata 2014-01-23 19:27:00 danger~ "They are getting rea~ 19:27:00
9 mbdata 2014-01-23 19:41:33 trucco~ "why didnt i stay at ~ 19:41:33
10 mbdata 2014-01-23 19:45:11 trucco~ "i think were trapped~ 19:45:11
11 mbdata 2014-01-23 20:00:00 danger~ "They just pulled a f~ 20:00:00
12 mbdata 2014-01-23 20:05:00 danger~ "They just put him in~ 20:05:00
13 mbdata 2014-01-23 20:05:00 sofite~ "#APD is evacuating u~ 20:05:00
14 mbdata 2014-01-23 20:34:00 trucco~ "arent swat suppposed~ 20:34:00
15 mbdata 2014-01-23 21:30:00 danger~ "OMG was that an expl~ 21:30:00
# ... with 1 more variable: geometry <POINT [°]>
gps_fire_point <- gps_fire_sf %>%
group_by(message) %>%
summarize(m = `time`,
do_union = FALSE) %>%
st_cast("MULTIPOINT")
tmap_mode("view")
tm_shape(bgmap)+
tm_rgb(r=1,g=2,b=3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_point) +
tm_dots(col = "red")
tmap_mode("view")
tm_shape(bgmap)+
tm_rgb(r=1,g=2,b=3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_fire_point) +
tm_dots(col = "red")
q3_author_lda <- LDA(q3_dtm,k = 3, control = list(seed = 1234))
q3_topics <- tidy(q3_author_lda,matrix = "beta")
q3_topics
# A tibble: 1,092 x 3
topic term beta
<int> <chr> <dbl>
1 1 pok 1.37e- 1
2 2 pok 2.64e-82
3 3 pok 1.84e- 2
4 1 standoff 1.25e-81
5 2 standoff 2.41e- 4
6 3 standoff 7.63e- 2
7 1 im 3.95e-82
8 2 im 5.06e- 5
9 3 im 4.28e- 2
10 1 shooting 1.24e-87
# ... with 1,082 more rows
q3_topics %>%
group_by(topic) %>%
top_n(10,beta) %>%
ungroup() %>%
arrange(topic,-beta) %>%
ggplot(aes(beta,term,fill = topic))+
geom_col(show.legend = FALSE)+
facet_wrap(~topic,ncol = 2,scales = "free")
